Notes on Numpy Arrays and Panda's Series and DataFrames

We need to import the numpy and pandas libraries before using them in this notebook


In [1]:
import numpy as np
import pandas as pd

Intro to Numpy Arrays

Here create a 1-dimension array with floating numbers. Note we did not have to type print before array if it is the last line of code


In [2]:
array = np.array([1,2,3,4],float)
array


Out[2]:
array([ 1.,  2.,  3.,  4.])

Here we create an 2x3 array


In [3]:
two_dimen_array = np.array([[1,2,3],[4,5,6]], float)
two_dimen_array


Out[3]:
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

Now let us show how to perform certain slicing and indexing operations on numpy Arrays.If we want to print from the 3rd element until the end on a vector array


In [4]:
array[2:]


Out[4]:
array([ 3.,  4.])

In this line of code, we print all elements from the 2nd element until the end of the 1st row.


In [5]:
two_dimen_array[0][1:]


Out[5]:
array([ 2.,  3.])

In this line, we print the 2nd row of the matrix


In [6]:
two_dimen_array[1, :]


Out[6]:
array([ 4.,  5.,  6.])

In this line, we print the 2nd column


In [7]:
two_dimen_array[:,1]


Out[7]:
array([ 2.,  5.])

In this line, I wanted to experiment a bit with slicing operations with the arithmetic operations. Here I print out the results of subtracting the 2nd row of the matrix from 1st three elements of the vector array


In [8]:
array[:3] - two_dimen_array[1,:]


Out[8]:
array([-3., -3., -3.])

Here we create a 2x2 array display it's output. Next, we print out the results of the 1st two elements from the vector array


In [9]:
two_by_two_array = np.array([[1,4],[2,5]],float)
two_by_two_array


Out[9]:
array([[ 1.,  4.],
       [ 2.,  5.]])

In [10]:
array[:2]


Out[10]:
array([ 1.,  2.])

This operation I found to be kind of weird since I would have assumed the multiplication of two arrays to be the dot product, but that is not what occurs here. In this operation, The first element of array is multiplied by each element of the first column in two_by_two_array and the 2nd element of the array is multiplied by each element of the 2nd column of two_by_two_array.


In [11]:
two_by_two_array * array[:2]


Out[11]:
array([[  1.,   8.],
       [  2.,  10.]])

If we want to compute the dot product, you have to use the dot member function of the numpy library. Here, we create a new array and perform the dot product on two_by_two_array


In [12]:
array2 = np.array([1,2],float)
array2


Out[12]:
array([ 1.,  2.])

In [13]:
np.dot(array2,two_by_two_array)


Out[13]:
array([  5.,  14.])

Intro to Panda Series and DataFrames

The Panda Series allows you to create a Panda Column in which you can put elements of any type within the same column. Here I create Panda Series with information about my father. I have modified the index so that it will print out the standard number indices


In [14]:
series = pd.Series(['Ransford', "Hyman Sr.", 'January', 1941], index=['First Name', 'Last Name', 'Birth Month', 'Birth Year'])
series


Out[14]:
First Name      Ransford
Last Name      Hyman Sr.
Birth Month      January
Birth Year          1941
dtype: object

Here we create a Python Dictionary and then convert that into a Panda DataFrame. Here we create a dictionary named family and create dataframe df from family.


In [15]:
family = {'name': ['Ransford','Denzel'],
         'Birth year': [1984, 2004],
         'favorite subject': ['Math','Science']}
family


Out[15]:
{'Birth year': [1984, 2004],
 'favorite subject': ['Math', 'Science'],
 'name': ['Ransford', 'Denzel']}

In [16]:
df = pd.DataFrame(family)
df


Out[16]:
Birth year favorite subject name
0 1984 Math Ransford
1 2004 Science Denzel

Notice that when we print out the dictionary it prints out in regular text, But when we print out the dataframe, it gives us a nice table in IPython. Pretty cool!!!

Now we are going to create a seperate dataframe and play around with some of it's member functions.


In [17]:
frank_grades = {'subject':['Math','English','Social Studies','Science','Music','Art'],
               'grades': [95,87,80,96,98,70]}
df2 = pd.DataFrame(frank_grades)
df2


Out[17]:
grades subject
0 95 Math
1 87 English
2 80 Social Studies
3 96 Science
4 98 Music
5 70 Art

Notice here that we can index Dataframes by the index name just like in Python dictionaries. Panda's dataframes have a function called describe which generates some interesting statiscal information. This function is very helpful for doing some initial sanity checking on the dataframe's columns. Here we are given the number of entries (given by the count row), the mean, standard deviation and the Interquartile Range (IQR).


In [18]:
df2['grades'].describe()


Out[18]:
count     6.000000
mean     87.666667
std      10.966616
min      70.000000
25%      81.750000
50%      91.000000
75%      95.750000
max      98.000000
Name: grades, dtype: float64

Here is a way to examine the DataFrame without printing the entire thing. The printout shows the printing of the first 2 rows of the subject column. Note that you can specify how rows you would like to print by passing the number as a parameter to the head function. The second printout shows the function being called on the entire dataframe object.


In [19]:
print df2['subject'].head(2)
df2.head()


0       Math
1    English
Name: subject, dtype: object
Out[19]:
grades subject
0 95 Math
1 87 English
2 80 Social Studies
3 96 Science
4 98 Music

Here we perform the tail on the DataFrame object.


In [20]:
print df2['grades'].tail()
df2.tail


1    87
2    80
3    96
4    98
5    70
Name: grades, dtype: int64
Out[20]:
<bound method DataFrame.tail of    grades         subject
0      95            Math
1      87         English
2      80  Social Studies
3      96         Science
4      98           Music
5      70             Art>

Of course we probably could have presented this same information in a python script with comments, but where is the fun in that? I hope that you find this page useful.